Analyzing Neurotransmitter Receptors & Protein Sequences

Ákos Kimpián, Joshua Lembeck, Elvin Kalinowski, Mikel Garcia Amez, Marcel Skumantz

2025-02-12

Introduction

  • @Mikel add information here

Data Set info

Channels

Material & Methods

- Dirtying + Cleaning
- EDA
- PCA
- Prediction

Dirtying and Cleaning

More data

EDA

Prediction preprocessing

  • AA composition variables (aa_*) as features
  • Receptor classes using pattern-based annotation of Protein_Name as target
    • Cys-loop receptors
    • Ionotropic glutamate receptors
    • Other ionotropic receptors

PCA

  • Examine structure before modelling
  • using tidymodels
  • PC1 <-> PC2 scatter
#...
pca_rec <- recipe(~., data = prediction_df) |>
  update_role(Protein_ID, Receptor_class, 
              new_role = "id") |>
  step_normalize(all_predictors()) |>
  step_pca(all_predictors())
#...

Predictive Modeling

  • Stratified \(80\)/\(20\) train–test split to maintain class balance
  • Random Forest classifier with \(1000\) trees
  • Basic Metrics and Mean Decrease Gini (MDG) as feature importance
#...
pca_rec <- recipe(~., data = prediction_df) |>
  update_role(Protein_ID, Receptor_class, 
              new_role = "id") |>
  step_normalize(all_predictors()) |>

rf_spec <- rand_forest(trees = 1000) |>
  set_engine("randomForest") |>
  set_mode("classification")
#...

Results

Correlation Matrix

:::::

PCA

:::::

Results of receptor family classification

Length and weight correlation analysis

Discussion (Joshua)

Biological Interpretation

  • Receptor classification

    • Valine, Glycine, Tryptophan, Serine, and Proline are the most discriminative amino acids for receptor prediction

    • Proline and Glycine are alpha-helix and beta-sheet breakers

    • Proline is highly relevant for loops and turns

Limitations and Future Directions

  • Success of the analysis?
    • prediction of Cys-loop and Glutamate ion receptors very accurate
    • other receptors are not well predicted due to lack of representation in the data set
  • What could be explored more in detail
    • test model against larger variety of background proteins